Statistical machine translation processing

ABSTRACT

A method of statistical machine translation (SMT) is provided. The method comprises generating reordering knowledge based on the syntax of a source language (SL) and a number of alignment matrices that map sample SL sentences with sample target language (TL) sentences. The method further comprises receiving a SL word string and parsing the SL word string into a parse tree that represents the syntactic properties of the SL word string. The nodes on the parse tree are reordered based on the generated reordering knowledge in order to provide reordered word strings. The method further comprises translating a number of reordered word strings to create a number of TL word strings, and identifying a statistically preferred TL word string as a preferred translation of the SL word string.

RELATED APPLICATION(S)

This application is a Continuation of, and claims the benefit of, U.S.patent application Ser. No. 11/977,133 that was filed on Oct. 23, 2007and that is incorporated herein by reference in its entirety.

TECHNICAL FIELD

The technology relates to the field of statistical machine translation,and related translation endeavors.

BACKGROUND

Prior to modern computer capabilities, almost all translations wereperformed by a human translator. For example, during a conference ofparties speaking different languages, it was common for a member tocommunicate with a human translator who could translate for bothparties. However, due to the advancement of computer capabilities,statistical machine translation (SMT) has become more available.

In general, SMT involves the translation of text from a source language(SL) to a target language (TL), generally by utilizing a computingsystem to carry out machine translation operations. Many modern SMTsystems involve the use of computer software to translate text orspeech. The relatively high speed with which modern computer systems canprocess large quantities of data makes SMT a powerful tool for quicklytranslating large volumes of text.

SUMMARY

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used as an aid in determining the scope of the claimed subjectmatter.

A method of SMT is provided wherein the method comprises generatingreordering knowledge based on the syntax of a SL and a number ofalignment matrices that map sample SL sentences with sample TLsentences. The method further comprises receiving a SL word string andparsing the SL word string into a parse tree that represents thesyntactic properties of the SL word string. The nodes on the parse treeare reordered based on the generated reordering knowledge in order toprovide reordered word strings. The method further comprises translatinga number of reordered word strings to create a number of TL wordstrings, and identifying a statistically preferred TL word string as apreferred translation of the SL word string.

DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated in and form a part ofthis specification, illustrate embodiments of the technology for SMTprocessing and, together with the description, serve to explainprinciples discussed below:

FIG. 1 is a flowchart of an exemplary translational process used inaccordance with an embodiment of the present technology for translatingan input string into an output string.

FIG. 2 is a block diagram of an exemplary processing environment used inaccordance with an embodiment of the present technology for processingan input source language string.

FIG. 3 is a diagram of an exemplary alignment matrix used in accordancewith an embodiment of the present technology for aligning sourcelanguage terms with target language terms.

FIG. 4 is a diagram of an exemplary reordering paradigm used inaccordance with an embodiment of the present technology for reorderingterms in a received sequence of terms.

FIG. 5 is a diagram of an exemplary computer system used in accordancewith an embodiment of the present technology for statistical machinetranslation (SMT).

The drawings referred to in this description should be understood as notbeing drawn to scale except if specifically noted.

DETAILED DESCRIPTION

Reference will now be made in detail to embodiments of the presenttechnology for SMT processing, examples of which are illustrated in theaccompanying drawings. While the technology for SMT processing will bedescribed in conjunction with various embodiments, it will be understoodthat they are not intended to limit the present technology for SMTprocessing to these embodiments. On the contrary, the presentedtechnology for SMT processing is intended to cover alternatives,modifications and equivalents, which may be included within the spiritand scope of the various embodiments as defined by the appended claims.

Furthermore, in the following detailed description, numerous specificdetails are set forth in order to provide a thorough understanding ofthe present technology for SMT processing. However, the presenttechnology for SMT processing may be practiced without these specificdetails. In other instances, well known methods, procedures, components,and circuits have not been described in detail as not to unnecessarilyobscure aspects of the present embodiments.

It is understood that discussions throughout the present detaileddescription that utilize terms such as “using”, “utilizing”,“implementing”, “mapping”, “matching”, “representing”, “analyzing”,“communicating”, “receiving”, “performing”, “generating”, “enabling”,“presenting”, “configuring”, “training”, “identifying”, “calculating”,“inverting”, “ranking”, “parsing”, “preprocessing”, “translating”,“ordering”, “reordering”, “providing”, “acquiring”, and “accessing”, orthe like, may refer to the actions and processes of a computer system,or similar electronic computing device. The computer system or similarelectronic computing device manipulates and transforms data representedas physical (electronic) quantities within the computer system'sregisters and memories into other data similarly represented as physicalquantities within the computer system memories or registers or othersuch information storage, transmission, or display devices. The presenttechnology for SMT processing is also well suited to the use of othercomputer systems such as, for example, optical and mechanical computers.

Overview

In general, “machine translation” is a translation performed by acomputer. While many may believe that modern machine translationprocesses entail one or more engineers painstakingly inputting a set oftranslation knowledge, such as may be found in a bilingual dictionary,into a computer, what makes SMT outstanding is that it does not rely onthe manual inputting of translation knowledge. Rather, a comprehensivevolume of text may be provided to the machine translation system, andthe system can then “guess” the translation knowledge itself byimplementing sophisticated statistical methods.

Thus, SMT is a process that is generally used by a machine translationsystem to translate a word string in a first natural language into aword string in a second natural language. For instance, words associatedwith a TL may be substituted for equivalent words in a SL string inorder to generate a new string of text in the TL. However, merelytranslating terms in a string of text and outputting these terms in thesame order that they were originally presented in the SL string may notprovide a preferred translation when the syntax of the SL differs fromthat of the TL.

For instance, the English phrase “tall man” is equivalent to the Spanishphrase “hombre alto”. Thus, it is not difficult to see that merelytranslating the terms “tall” and “man” in the English phrase into theirSpanish equivalents will not generate the corresponding Spanish phrase“hombre alto”. Rather, a process of word reordering may be needed,especially when translating longer and more complex strings, to reorderterms in a SL string such that multiple new term sequences aregenerated, and wherein a preferred translated term sequence isidentified as a correct TL translation of the SL string.

It is understood that embodiments of the present technology provide ameans of rearranging words in a SL string by implementing a reorderingschema comprising a one-to-many transformation. For instance, a parsetree of a sentence that is to be translated may be generated, and thewords of the sentence could be rearranged pursuant to a syntacticstructure associated with the generated parse tree. Various otherembodiments teach preprocessing a SL string such that the words of thestring are rearranged into multiple distinct orders before a subsequenttranslation process is implemented. In this manner, multiple reorderedterm sequences are identified, and one of these term sequences may beselected based on an implemented probabilistic assessment of each of thesequences.

Generally, a SMT process comprises two primary stages. First, a trainingprocess is implemented wherein the machine translation system builds upits own translation knowledge. Second, a decoding process is implementedwherein a received SL string is translated into a corresponding TLstring.

According to an embodiment of the present technology, during thetraining stage, the machine translation system is provided a set oftraining data, which comprises a comprehensive collection of sentencepairs. It is understood that a sentence pair comprises a sentence in aSL and a corresponding sentence in a TL. A word alignment process isapplied to the training data, wherein for each sentence pair, a wordalignment model produces an alignment matrix that indicates which SLwords correspond to which TL words. By summing up all of the alignmentmatrices of all sentence pairs in the training data, a translation tablecan be obtained. The results of the alignment model can then beforwarded to a translation database used to store how words and phrasesin a SL translate into words and phrases in a TL.

After the training process has been completed, the machine translationsystem may begin a subsequent decoding process in order to identify apreferred translation of a received SL string. With reference now toFIG. 1, an exemplary translational process 200 used in accordance withan embodiment of the present technology for translating an input string210 into an output string 220 is shown. In the present embodiment, theinput string 210 is comprised of multiple SL terms, while the outputstring 220 is comprised of one or more terms associated with a TL. Thetranslational process 200 comprises reordering the terms of the inputstring 210 into different orders in order to generate a plurality of SLstrings 230. The SL strings 230 correspond to a plurality of TL strings240 which are the TL equivalents of the plurality of SL strings 230.

With reference still to the embodiment illustrated in FIG. 1, a targetword string is identified from among the plurality of TL strings 240 asa correct translation of the input string 210. This target word stringmay be selected based on a set of reordering probabilities associatedwith the SL strings 230, as well as a scoring metric generated during adecoding stage of the SMT process. The identified word string may thenbe outputted as the output string 220.

For example, a probability of reordering the input string 210 as aspecific SL string from among the plurality of SL strings 230 could becalculated for each string among the plurality of SL strings 230. Next,the calculated reordering probabilities could be used to generate adecoding score for each of the plurality of TL strings 240. Astatistically preferred TL string could then be identified based on thegenerated decoding scores.

Therefore, it is understood that various embodiments of the presenttechnology provide a dynamic approach to term reordering that allowsmany different SL strings to be considered as possible reordered stringstructures during the overall translation process, and wherein a targetword string may be selected as a preferred translation of an input wordstring based on an assessment of calculated reordering probabilities anddecoding scores. Moreover, various embodiments of the present technologypresent a novel approach to long-distance word reordering by arrangingword orders by operations on a parse tree of a sentence or word stringthat is to be translated from a SL into a TL.

It is further understood that various exemplary embodiments of thepresent technology are described in the context of SMT processing.However, various embodiments of the present technology are also wellsuited to be used for machine mapping applications, as well as variousprobabilistic analysis and selection processes. Indeed, the presenttechnology is also useful for manual translation applications, orprocesses involving the translation of non-textual subject matter. Theembodiments described below are in the context of SMT processing forpurposes of illustration and are not intended to limit the spirit orscope of the present technology.

String Preprocessing

With reference now to FIG. 2, an exemplary processing environment 300for processing an input SL string 310 in accordance with an embodimentof the present technology is shown. The processing environment 300comprises a parsing module 320 that receives the input SL string 310 andgenerates a parsed SL string 330. The parsed SL string 330 is thenreceived by a reordering stage 340 which generates multiple reordered SLstrings 350 which may then be analyzed and further processed during acomprehensive SMT process.

The parsing module 320 in FIG. 2 receives the input SL string 310 andparses the received input SL string 310 into a parse tree comprisingmultiple nodes, wherein the parse tree represents the syntacticstructure of the input SL string 310 according to the SL syntax. Forinstance, the parsing module 320 could be configured to implement asyntactic analysis to parse the input SL string 310 into multiple nodes,wherein each node corresponds to a word or phrase found in the input SLstring 310. The parsing module 320 could then divide the terms of theinput SL string 310 into groups corresponding to syntactic components ofthe input SL string 310.

It is understood, however, that the aforementioned method of parsing theinput SL string 310 is merely exemplary, and is not meant to narrow thescope of the present technology. Indeed, other methods of dividing theinput SL string 310 into multiple components or nodes may be utilized.

With reference still to FIG. 2, the parsing module 320 generates aparsed SL string 330, which is a parse tree comprising the nodesgenerated during the parsing process. This parsed SL string 330 is thenreceived by the reordering stage 340 which is configured to reorder thenodes of the parsed SL string 330 into different orders or sequencespursuant to a reordering model, such as a linguistic syntax-basedreordering model. The reordering stage 340 is further utilized togenerate multiple reordered SL strings 350 based on these differentorders or sequences.

The reordering stage 340 comprises a preprocessing module 341 configuredto receive the parsed SL string 330 and access the nodes of the parsedSL string 330. The preprocessing module 341 is further configured toanalyze each node in the parse tree of the input SL string 310 in viewof a reordering model that utilizes the syntax of the SL to determine ifsuch nodes may be reordered so as to create a new sequence. Forinstance, the preprocessing module 341 will first analyze the top nodeof the parse tree, wherein the top node is itself a parent node, todetermine, in view of data provided by the reordering model, how likelyit is that the child nodes of the top node are to be inverted orreordered. The preprocessing module 341 could next analyze one of theaforementioned child nodes to determine if such child node is itself aparent node of two or more distinct child nodes, and then these otherchild nodes could be scrutinized using the reordering model to determinehow likely it is that they are to be inverted or reordered. This processmay continue until the preprocessing module 341 has reached the bottomof the parse tree.

The functionality of the preprocessing module 341, pursuant to variousexemplary embodiments, will now be explained in further detail so as toillustrate various principles of the present technology. It isunderstood, however, that the embodiments described herein are providedfor purposes of illustration and are not intended to limit the spirit orscope of the present technology.

The preprocessing module 341 utilizes a reordering knowledge 345 toreorder the terms of the parsed SL string 330, wherein the reorderingknowledge 345 is based on the syntax of the SL and a number of alignmentmatrices that map sample SL sentences with sample TL sentences. Thereordering knowledge 345 enables the preprocessing module 341 todetermine whether one or more nodes in the parsed SL string 330 are tobe reordered or not. This reordering knowledge 345 is acquired bycollecting a set of training samples.

For instance, a set of training samples could be collected that compriseSL sentences paired with TL sentences. These sentence pairs could thenbe analyzed so as to compare how syntactic rules associated with wordplacement in the SL are different from the syntactic rules of the TL. Inthis manner, the collected training samples may be utilized to determinehow words and phrases in the SL translate into words and phrases in theTL.

It is understood, however, that various methods exist for acquiring theaforementioned reordering knowledge 345. With reference still to FIG. 2,in one embodiment, the reordering stage 340 further comprises a trainingdatabase 342. The training database 342 is used to store a set oftraining data, which comprises a collection of sentence pairs. Asentence pair comprises a sentence in a SL and a corresponding sentencein a TL. It is understood that this training database 342 may beaccessed in order to obtain information pertaining to the storedsentence pairs.

Moreover, the preprocessing module 341 implements a training instanceacquisition process 343 to acquire information needed in the reorderingprocess. The objective of the training instance acquisition process 343is to allow the preprocessing module 341 to design a form of reorderingknowledge 345 that can be directly applied to parse tree nodes. Thepreprocessing module 341 accesses the training database 342 and collectstraining instances that map sample SL sentences with sample TLsentences. Specifically, the sample sentences are mapped such that SLterms are mapped to TL terms. In this manner, a number of alignmentmatrices may be generated that identify terms in the parsed SL string330 that may be reordered with regard to a formal syntax.

In one embodiment, a training instance comprises a pair of SL phrasesand a label that communicates whether the phrases are in a correct orderor inverted with respect to a formal syntax. In another embodiment, thecollected training data allows the preprocessing module 341 to identifyindividual terms that could be reordered in a term sequence inaccordance with the SL syntax. In an alternative embodiment, thetraining data may enable the preprocessing module 341 to identify thenumber of instances of a particular SL term in a SL string, and whetherthe number of instances of this term needs to be increased or decreasedbased on the syntax associated with a TL.

It is therefore understood that the preprocessing module 341 may beconfigured to compare one or more phrase orders found in a SL sentencewith a phrasal paradigm associated with a TL, and then generate traininginstances based on these phrase orders such that the training instancesare configured to identify syntactic orders associated with various wordphrases.

In one embodiment, an alignment model is used to map the occurrence of aSL term in a SL sentence to a corresponding TL term in a TL sentence.For each sentence pair, the word alignment model produces an alignmentmatrix, wherein SL terms are mapped to corresponding TL terms, and therespective orientations of the matching pairs create a number oftraining samples that the preprocessing module 341 can use in thereordering process. In this manner, a number of training instances aregenerated wherein each training instance comprises a pair of SL phrasesand a pair of TL phrases, and which identifies an order of these SLphrases.

In another embodiment, the alignment model is configured such thatmatching pairs of terms in an alignment matrix are enumerated andarranged in an order in which the terms might appear in a TL string,according to syntactic rules associated with a TL. Due to a syntacticalignment inherent in the matrix, the monotonicity associated with SLand TL counterparts may be assessed in order to generate a model forreordering terms and phrases in the parsed SL string 330. Furthermore,the training instance acquisition process 343 may be applied to a greatnumber of nodes associated with the sentence pairs in the training datain order to collect a comprehensive set of training samples, which mayallow a more accurate and effective body of reordering knowledge to begenerated.

With reference now to FIG. 3, an exemplary alignment matrix 400 foraligning SL terms with TL terms in accordance with an embodiment of thepresent technology is shown. The alignment matrix 400 allows a string ofSL terms 410 to be compared with an associated string of TL terms 420.Individual SL terms are mapped to corresponding TL terms in order toacquire training samples for reordering parse tree nodes associated withthe input SL string 310. In addition, matching terms are furtherenumerated for identification and reordering purposes.

In the example shown in FIG. 3, a first SL phrase 412 and a second SLphrase 413 are two word strings covered by two nodes in the parse treeof the string of SL terms 410. The first SL phrase 412 is mapped to afirst TL phrase 422, while the second SL phrase 413 is mapped to asecond TL phrase 413. The fact that the first SL phrase 412 precedes thesecond SL phrase 413 in the string of SL terms 410 whereas the TLcounterparts of the two phrases are in the reverse order implies thatthe first SL phrase 412, the second SL phrase, the first TL phrase 422and the second TL phrase constitute a single training instance of phraseinversion.

In one embodiment, the alignment matrix 400 may be used not only toidentify one or more phrases of interest, but may also be used toidentify individual terms of interest. For example, with reference tothe second reordering possibility 440, a single instance of the term“china” occurs in the string of SL terms 410 while two instances of thesame term are identified in the associated string of TL terms 420.Therefore, a third training instance is identified that will communicateto the preprocessing module 341 that one or more terms in the parsed SLstring 330 may need to occur multiple times in a reordered string.

However, in certain cases, a TL equivalent of a particular SL term in aSL string may not be present in a corresponding translation of the SLstring. Therefore, a training instance may be accessed that communicatesto the preprocessing module 341 that one or more SL terms may be omittedfrom a reordered SL string.

The alignment matrix 400 in FIG. 3 has been presented merely toillustrate certain principles of the present technology. Indeed, thisalignment matrix 400 is only one example of how such a matrix might beconfigured and implemented according to principles of the presenttechnology. Other methods of generating training samples and analyzingnuances in syntactic phrase and string structure may be implemented.

With reference still to FIG. 2, the preprocessing module 341 nextimplements a reordering knowledge formation process 344 in which theacquired training samples are analyzed and processed to form areordering knowledge 345 that the preprocessing module 341 will utilizeduring the SL term reordering process. It is understood, however, thatvarious methods exist for forming such reordering knowledge 345 from aset of acquired training instances.

For example, a probabilistic reordering calculation could be implementedto determine whether one or more SL terms may be reordered in a new SLterm sequence pursuant to a formal syntax. As a second example, areordering probability could be calculated in order to identify alikelihood that two or more child nodes associated with a parent nodeare to be inverted with respect to a syntactic phrase structure.

In another embodiment, the reordering knowledge formation process 344involves estimating a probabilistic distribution over all trainingsamples. For instance, a maximum entropy (ME) model could be used toestimate such a probabilistic distribution over pairs of SL phrases,represented by specific features, in the parsed SL string 330. In anexemplary application of principles of the present technology, if thereordering knowledge 345 is generated using a maximum entropy (ME)model, then a reordering probability may be calculated or representedas:

${{P\left( {rf} \right)} = \frac{\exp\left( {\sum\limits_{i}{\lambda_{i}{f_{i}\left( {N,r} \right)}}} \right)}{\sum\limits_{r^{\prime}}{\exp\left( {\sum\limits_{i}{\lambda_{i}{f_{i}\left( {N,r^{\prime}} \right)}}} \right)}}},$

where r

{IN-ORDER, INVERSED}, and where ƒ_(i)'s and λ_(i)'s are features andfeature weights, respectively, used in the maximum entropy (ME) model.

When accessing training samples during the training instance acquisitionprocess 343, terms from two distinct phrases may sometimes overlap dueto a mistake of an implemented word alignment process. For instance, twodifferent phrases may contain a common term, but use that term in adifferent term grouping, or in a different context. This may have adetrimental effect on the reordering process where all points in thealignment matrices of a training corpus are collected in order tocalculate a probabilistic distribution associated with a specific SLterm sequence. Therefore, in another embodiment, the reorderingknowledge formation process 344 is configured such that thepreprocessing module 341 will ignore those phrases that overlap in orderto ameliorate the quality of the reordering knowledge 345.

Although implementation of the aforementioned embodiment may reduce theavailable amount of training data, it is further understood that suchdata sparseness may be remedied by removing or ignoring alignment pointsin the matrix that are less probable so as to minimize the occurrence ofoverlapping phrases. For instance, after removing a particular alignmentpoint, one of the phrases may become shorter such that there is asmaller probability of two phrases overlapping. Thus, in an alternativeembodiment, the preprocessing module 341 is configured to redefine oneor more pairs of overlapping phrases by iteratively removing lessprobable word alignments until these phrases no longer overlap. If afterthis redefining process two or more phrases continue to overlap, then anode corresponding to the overlapping phrases is not associated with atraining instance.

Next, the preprocessing module 341 implements a reordering knowledgeapplication process in which the reordering knowledge 345 is applied tothe nodes of the parsed SL string 330 for the purpose of ordering termsand nodes into different orders. These reordered term sequences are thenused to generate reordered SL strings 350 that can be furtherscrutinized to identify a preferred translation of the input SL string310. It is understood, however, that various paradigms may beimplemented to order the SL terms into different orders within thespirit and scope of the present technology.

In one embodiment, one or more of the nodes of the parsed SL string 330are reordered, while the term sequence inherent in each node remains thesame. For instance, the preprocessing module 341 could be configured toapply the reordering knowledge 345 to a parse tree corresponding to theparsed SL string 330 and identify multiple term sequences eachcomprising the nodes of the parsed SL string 330, but wherein the nodesare arranged in different orders. In an alternative embodiment, a nodeis comprised of two or more SL terms, and these terms are reorderedbased on a syntactic rule identified in the reordering knowledge 345.

In another embodiment, a node associated with the input SL string 310 isidentified as being a parent node comprised of two or more nodalcomponents. Furthermore, the reordering knowledge application processcould comprise determining a likelihood of whether the nodal components,or children, of the binary node are to be inverted or reordered, such asby analyzing an inversion or reordering probability associated with theparent node. In this manner, the preprocessing module 341 could beconfigured to reorder the terms of one or more nodes based on aprobabilistic assessment associated with one or more reordered forms ofmultiple nodal components.

To further illustrate, the reordering knowledge application processcould comprise identifying a binary node, and its corresponding childnodes, associated with a parse tree of the parsed SL string 330. Thepreprocessing module 341 would calculate an inversion probabilityassociated with the binary node, and determine how likely it is that thetwo child nodes of the binary node are to be inverted based on theinversion probability. Such an inversion probability could be generated,for example, according to a reordering estimation having the followingform:

$\left. {Z\text{:}{X \cdot Y}}\Rightarrow\left\{ \begin{matrix}{{{X \cdot Y}->P_{r}} = p} \\{{{Y \cdot X}->P_{r}} = {1 - p}}\end{matrix} \right. \right.$

Here, Z represents a phrase having child nodes X and Y. A probability,p, that X and Y are not inverted in the corresponding TL phrase iscalculated, along with a probability, 1−p, that these nodes areinverted.

It is understood, however, that the present technology is not limited toreordering child nodes associated with a binary parent node. Indeed, aprobabilistic distribution could be calculated that may apply to parentnodes having any number of child nodes. For example, given that n!component sequences may exist for a parent node having n number ofchildren, a reordering probability corresponding to a phrase comprisedof three nodal components, p₁, p₂ and p₃, may be generated as:

P _(r)(r)×P _(r)×(p ₁)×P _(r)(p ₂)×P _(r)(p ₃),

where r is one of six possible reordering patterns of the 3-ary parentnode.

In yet another embodiment, the reordering knowledge application processcomprises ranking the identified SL term sequences based on a scoringmetric. For instance, a probability, P_(r)(p→p′), of reordering terms ina phrase p into a new phrase p′ could be identified, and this reorderingprobability could be used to score a term sequence comprising thereordered phrase p′. It is understood that various types of such scoringmetrics may be implemented, and that the present technology is notlimited to any one type of probabilistic modeling.

In one embodiment, a scoring metric, such as the scoring paradigmpreviously described, is used to identify a group of possible termsequences that most closely approximate a term order associated with apreferred translation of the input SL string 310. For instance, 2n²possible reorderings may exist for a particular node if each of itschild nodes has n reorderings, but because this number of reorderingsmay be relatively large, depending on the value of n, the top-scored nreorderings are flagged or recorded so as to generate a “n-preferred”group of possible term sequences. Thus, the preprocessing module 341could be configured to generate a limited number of reordered termsequences rather than a different sequence for every term orderpermutation of the identified SL terms. This would have the practicaleffect of preserving precious processing time and energy whilesimultaneously increasing the efficiency of a translation model.

TABLE 1 1. Receive SL string: “un hombre alto” 2. Reorder child nodes ofSL “un hombre alto” string into different → “un alto hombre” orders: →“un hombre alto”

An example of a simple reordering schema will now be presented in orderto illustrate an embodiment of the present technology. With reference toTable 1, above, a SL string is received that is comprised of multiplenodes. In the provided example, a simple Spanish phrase is utilized forthe sake of simplicity. The objective of the machine translation processis to translate the Spanish phrase “un hombre alto” into its Englishequivalent, “a tall man”.

Due to syntactic differences between the English and Spanish languages,a simple translation of each of the terms in the SL string would notprovide a “preferred” translation. However, this problem is solved byproviding multiple possible arrangements for the terms of the SL string,and scrutinizing these arrangements using a probabilistic assessment inorder to identify an arrangement that presents a preferred sequence ofterms that may be used to generate a preferred translation.

With reference still to Table 1, after the SL string, “un hombre alto”,is received, the terms of the string are ordered into two or moredifferent sequences. In the example given, two possible term sequencesare generated: “un alto hombre” and “un hombre alto”. The firstsequence, “un alto hombre”, is a reordered string in which the terms“hombre” and “alto” have been inverted in the original term sequence ofthe SL string. Although the provided example utilizes a grammaticallysimple phrase for the sake of clarity, a greater number of reorderedterm sequences could be generated when translating longer SL strings.

The second sequence that is generated, “un hombre alto”, presents theterms of the SL string in the original term sequence. One reason forretaining the original term sequence as a possible term sequence duringthe machine translation process is that a TL string that corresponds toa SL string may present the translated SL terms in the same exact orderas the received SL string. Therefore, filtering the original termsequence from the group of reordered term sequences may have thepractical effect of degrading translation quality. However, for the sakeof brevity and simplicity, various references throughout this DetailedSpecification to one or more reordered strings, reordered sequences, orthe like, may include a sequence of terms that corresponds to the termsequence of the received SL string, even though the terms in thissequence are arranged pursuant to the original sequencing schema.

TABLE 2 “un alto hombre” → P_(r) = 0.95 “un hombre alto” → P_(r)

With reference now to Table 2, above, an exemplary inversion probabilityformat is demonstrated with reference to the Spanish SL string sequencesin Table 1. An inversion probability of 5% has been assigned to thestring “un hombre alto”, which signifies that the probability, p, of theTL equivalents of the terms “hombre” and “alto” not being inverted inthe corresponding TL string is relatively low. In contrast, an inversionprobability of 95% has been assigned to the string “un alto hombre”,which communicates that the probability, 1−p, of the terms “hombre” and“alto” being inverted in the corresponding TL string is relatively high.

However, the aforementioned inversion probability format is merely oneway of assessing reordering probabilities associated with reordered termsequences. Other methods of comparing term sequences may also beimplemented.

Once a reordered term sequence has been identified, the preprocessingmodule 341 is then able to create a reordered SL string comprising termsfrom the parsed SL string 330 arranged in the order of the reorderedsequence. In one embodiment, the preprocessing module 341 generatesmultiple SL strings 350 each comprising the terms of the input SL string310, but wherein the terms in each reordered string are arranged in adifferent order. The reordered SL strings 350 may then be transmitted toa decoding or translation module, where the reordered SL strings 350 maybe translated into corresponding TL strings. Furthermore, in anotherembodiment, the preprocessing module outputs a group of generatedreordering probabilities 360 that may be used to provide a logical ormathematical basis for selecting one of these TL strings as a preferredtranslation of the input SL string 310.

String Translation and Selection

After the reordering stage 340 has finished processing the parsed SLstring 330, the reordered SL strings 350 and corresponding reorderingprobabilities 360 are received by a decoding module 370. The decodingmodule 370 is configured to translate the reordered SL strings 350 intocorresponding TL strings, and one of these TL strings may then beidentified as a preferred translation of the input SL string 310. Forinstance, the decoder 370 could be configured to implement a monotonoustranslation paradigm pursuant to which the terms of the reordered SLstrings 350 are translated without further reordering the terms. In thismanner, the term sequence of a generated TL strings mirrors the sequenceof terms in its corresponding reordered SL string. It is understood,however, that a number of methods exist for decoding the reordered SLstrings 350, and that the present technology is not limited to anyparticular decoding methodology.

With reference still to the embodiment illustrated in FIG. 2, thedecoder 370 receives the generated reordering probabilities 360 andutilizes this information to select a preferred translation of the inputSL string 310. It is understood that different methods may beimplemented for applying the aforementioned reordering probabilitiesduring the decoding process. In one embodiment, the decoding module 370assigns a decoding score to each of the TL strings based on a reorderingprobability corresponding to each individual string. For instance, areordering probability, P_(r)(S→S′) of reordering the input SL string310, represented as “S”, into a specific reordered SL string,represented as “S′”, could be calculated during the reordering stage 340and then utilized by the decoding module 370 to generate a decodingscore for a TL term sequence, represented as “T′”, corresponding to thespecific reordered SL string, S′.

In one embodiment, the following log-linear paradigm is utilized togenerate a decoding score for T′:

${\exp\left( {{\lambda_{r}\log \; {P_{r}\left( {S->S^{\prime}} \right)}} + {\sum\limits_{i}{\lambda_{i}{F_{i}\left( {S^{\prime}->T^{\prime}} \right)}}}} \right)}.$

In this exemplary paradigm, the first term of the formula corresponds tothe contribution of syntax-based reordering in an overall reorderingscheme. The second portion of the model takes into account specificfeatures, represented as “F_(i)”, used in the decoder. In addition,associated feature weights, represented as “λ_(i)”, such as weightsidentified during a process of minimum error rate training, are alsotaken into account during the scoring paradigm.

After decoding scores have been assigned to the possible TL termsequences, one of the sequences may be selected based on itscorresponding score. For instance, the calculated decoding scores couldbe compared to determine which of the possible TL term sequences, whichcorrespond to the reordered SL strings 350, has been assigned thehighest relative score. Once a TL term sequence has been identified asbeing a preferred translation of the input SL string 310, the decodingmodule 370 uses this term sequence to generate an output TL string 380.This output TL string 380 may then be accessed such that a preferredtranslation of the input SL string 310 may be acquired and implementedfor various purposes.

TABLE 3 1. “un alto hombre”→ P_(r) = 0.95 “un hombre alto”→ P_(r) = 0.052. “un alto hombre”→ T′₁ = “a tall man” “un hombre alto”→ T′₂ = “a manwho is tall” 3. “a tall man” → Score (T′₁) = 0.0433 “a man who is tall”→ Score (T′₂) = 0.0083

With reference now to Table 3, above, an example of a string ranking andselection process is demonstrated with reference to the Spanish SLstring sequences used in Table 1 and Table 2, supra. Reorderingprobabilities are calculated for each of the SL term sequences during areordering stage 340, and a decoding module 370 generates TL termsequences that correspond to these SL sequences. In the present example,the SL phrases “un alto hombre” and “un hombre alto” are associated withtheir corresponding TL phrases, “a tall man” and “a man who is tall”,respectively.

It is understood that a ranking schema may also be implemented toidentify one or more possible term sequences that most closelyapproximate a term sequence of a preferred translation. Such a rankingschema could be based, for example, on decoding scores that take intoaccount the calculated reordering probabilities, and may be useful whena relatively large number of possible reordered word strings exist for areceived SL string.

With reference still to Table 3, a decoding score is assigned to each ofthe TL strings. In the present example, the phrase “a tall man” isassigned a score of 0.0433, while the phrase “a man who is tall” isassigned a score of 0.0083. These scores are calculated by taking intoaccount the previously calculated reordering probabilities. It isunderstood, however, that the specific numerical scores referenced inTable 3 have been arbitrarily chosen to demonstrate principles of anembodiment of the present technology, and are not meant to demonstrate apreference for any one scoring model over alternative paradigms.

After the decoding scores have been generated, they may then be comparedin order to select a single TL sequence that will be utilized togenerate the output TL string 520. For example, the numerical value of0.0433 could be compared with the value of 0.0083 to determine that arelatively higher score has been assigned to the phrase “a tall man”.This phrase would then be utilized to generate a final TL string,represented as “T”.

Example Reordering Paradigm

An exemplary reordering knowledge acquisition and application processwill now be discussed so as to further illustrate various concepts ofthe present technology. As previously stated, an SMT process maycomprise two primary stages: a training stage during which a machinetranslation system builds up its own translation knowledge, and atranslation stage during which a decoder is used to translate a receivedSL string into a corresponding TL string. The implementation of thesestages pursuant to an embodiment of the present technology is explainedherein.

The SMT process begins with a training process during which a reorderingknowledge is learned based on a set of training data. The training datacomprises a comprehensive collection of sentence pairs, and may consistof a set of data used in ordinary SMT tasks. For each sentence pair inthe training data, a process of word alignment is implemented wherein analignment matrix of two corresponding sentences is provided. Thus, bygoing through all of the nodes on all parse trees of the sentence pairsin the training data, a comprehensive group of training instances may becollected. A training instance may comprise, for instance, a pair of SLphrases plus the label “IN-ORDER” or the label “INVERTED”.

Next, a parsing process is implemented, which provides a syntacticanalysis, in the form of a parse tree, of the SL sentence. Utilizing theparse tree, the machine translation system analyzes every parent nodehaving two or more child nodes. These child nodes correspond to two ormore different SL phrases, and from the alignment matrix twocorresponding TL phrases may be identified. At this point, it may bedetermined whether the order of the pair of SL phrases is the same asthe order of the pair of TL phrases.

The training instances are then fed to a machine learning algorithm,such as a maximum entropy (ME) model, which is configured toautomatically learn a set of knowledge given a comprehensive amount ofdata. This algorithm is used to calculate the probability of aparticular case in view of a particular condition. During a reorderingprocess, this algorithm is used to calculate the probability that twonodes may be inverted or not, given the “features” of these two nodes.It is understood, however, that there are many ways to define suchfeatures when representing a node.

It is further understood that the learning process of the machinelearning algorithm is based on the acquired training instances, and thatthe generated ME model represents the reordering knowledge itself. Thus,the reordering knowledge may itself be a ME model, which is based ontraining data comprising SL sentences paired with TL sentences.

After an ME model is generated, the model may then be applied. Given asource word string to be translated, a parse tree of the source wordstring is obtained from a parsing module. Each parent node having two ormore child nodes is analyzed, and the probability of inverting and notinverting these child nodes is determined. All possible reordered forms,and associated reordering probabilities, of the parse tree may beobtained.

With reference now to FIG. 4, an exemplary reordering paradigm 500 forreordering terms in a received sequence of terms in accordance with anembodiment of the present technology is shown. The reordering paradigm500 comprises parsing the received term sequence in order to create aparse tree 510 corresponding to the term sequence, wherein the parsetree 510 may be represented as an inverted tree structure comprising oneor more parent nodes associated with a plurality of child nodes. A nodecorresponds to a phrase of the received term sequence. For example, thenode N′ corresponds to the phrase “mujer alta”. It is understood,however, that although a Spanish word string has been selected for theillustrated embodiment, the present embodiment could also be used withother natural languages.

With reference still to FIG. 4, the terms of the received term sequencemay be reordered into different possible term sequences by generatingparse trees for each of these term sequences. In the illustratedembodiment, a first parse tree 520 is generated that corresponds to theterm sequence “una mujer alta”, and a second parse tree 530 is generatedthat corresponds to the term sequence “mujer alta una”. Moreover, athird parse tree 540 is generated that corresponds to the term sequence“una alta mujer”, and a fourth parse tree 550 is generated thatcorresponds to the term sequence “alta mujer una”.

As shown in FIG. 4, the reordered source word strings can be obtainedfrom the reordered trees. It is understood, however, that the exampleshown in FIG. 5 is relatively simple in so much as four possiblereordered forms are illustrated. Indeed, a greater number of possiblereordered forms may exist for a sentence of average length.

With reference still to FIG. 4, reordering probabilities are calculatedfor each of the possible term sequences. For instance, a probability ofreordering each group of child nodes corresponding to the same parentnode could be calculated for each reordering possibility, and thesereordering probabilities could be compiled as a set of probabilisticassessments 560. As shown in the set of probabilistic assessments 560,the probability of reordering the sequence of nodes D N′ as D N′ isdetermined to be 0.9, whereas the probability of reordering the sequenceof nodes D N′ as N′ D is determined to be 0.1. Similarly, theprobability of reordering the sequence of nodes N Adj as N Adj isdetermined to be 0.2, whereas the probability of reordering the sequenceof nodes N Adj as Adj N is determined to be 0.8.

These probabilities can then be used to generate comprehensivereordering probabilities for the possible term sequences. For instance,a reordering probability P=0.9×0.2 is assigned to the term sequence “unamujer alta”, which corresponds to the first parse tree 520. In thismanner, reordering probabilities can also be calculated for thereordered term sequences corresponding to the second parse tree 530, thethird parse tree 540 and the fourth parse tree 550, respectively.

The possible reordered term sequences may next be ranked with respect totheir corresponding reordering probabilities, and the most likely, or“n-preferred”, reordered sequences may be selected accordingly. Thesereordered word strings and their associated reordering probabilities arethen routed to a decoder, such as the decoding module 370 of FIG. 2. Thedecoder will translate the reordered word strings into TL wordsequences, and score these TL word sequences based on the reorderingprobabilities associated with the reordered SL word strings thatcorrespond to the TL word sequences. One of the TL word sequences maythen be selected, based on this scoring paradigm, as a preferredtranslation of the originally received SL term sequence.

Example Computer System Environment

With reference now to FIG. 5, portions of the technology for SMTprocessing may be comprised of computer-readable and computer-executableinstructions that reside, for example, in computer-usable media of acomputer system. That is, FIG. 5 illustrates one example of a type ofcomputer that can be used to implement embodiments, which are discussedbelow, of the present technology for SMT processing.

FIG. 5 illustrates an exemplary computer system 100 used in accordancewith embodiments of the present technology for SMT processing. It isappreciated that computer system 100 of FIG. 5 is exemplary only andthat the present technology for SMT processing can operate on or withina number of different computer systems including general purposenetworked computer systems, embedded computer systems, routers,switches, server devices, consumer devices, various intermediatedevices/artifacts, stand alone computer systems, and the like. As shownin FIG. 5, computer system 100 of FIG. 5 is well adapted to havingperipheral computer readable media 102 such as, for example, a floppydisk, a compact disc, and the like coupled therewith.

Computer system 100 of FIG. 5 includes an address/data bus 104 forcommunicating information, and a processor 106A coupled with bus 104 forprocessing information and instructions. As depicted in FIG. 5, computersystem 100 is also well suited to a multi-processor environment in whicha plurality of processors 106A, 106B, and 106C are present. Conversely,computer system 100 is also well suited to having a single processorsuch as, for example, processor 106A. Processors 106A, 106B, and 106Cmay be any of various types of microprocessors. Computer system 100 alsoincludes data storage features such as a computer usable volatile memory108, e.g. random access memory (RAM), coupled with bus 104 for storinginformation and instructions for processors 106A, 106B, and 106C.

Computer system 100 also includes computer usable non-volatile memory110, such as read only memory (ROM), coupled with bus 104 for storingstatic information and instructions for processors 106A, 106B, and 106C.Also present in computer system 100 is a data storage unit 112 (e.g., amagnetic or optical disk and disk drive) coupled with bus 104 forstoring information and instructions. Computer system 100 also includesan optional alphanumeric input device 114 including alphanumeric andfunction keys coupled with bus 104 for communicating information andcommand selections to processor 106A or processors 106A, 106B, and 106C.Computer system 100 also includes an optional cursor control device 116coupled with bus 104 for communicating user input information andcommand selections to processor 106A or processors 106A, 106B, and 106C.Computer system 100 of the present embodiment also includes an optionaldisplay device 118 coupled with bus 104 for displaying information.

Referring still to FIG. 5, optional display device 118 of FIG. 5 may bea liquid crystal device, cathode ray tube, plasma display device orother display device suitable for creating graphic images andalphanumeric characters recognizable to a user. Computer system 100 mayalso include a graphical representation controller module 119 forenabling generation of graphical representations of portions ofaggregated on-line security information from a plurality of sources.

Optional cursor control device 116 allows the computer user todynamically signal the movement of a visible symbol (cursor) on displaydevice 118. Many implementations of cursor control device 116 are knownin the art including a trackball, mouse, touch pad, joystick or keys onalpha-numeric input device 114 capable of signaling movement of a givendirection or manner of displacement. Alternatively, it will beappreciated that a cursor can be directed and/or activated via inputfrom alpha-numeric input device 114 using keys and sequence commands.

Computer system 100 is also well suited to having a cursor directed byother means such as, for example, voice commands. Computer system 100also includes an I/O device 120 for coupling computer system 100 withexternal entities. For example, in one embodiment, I/O device 120 is amodem for enabling wired or wireless communications between computersystem 100 and an external network such as, but not limited to, theInternet.

Referring still to FIG. 5, various other components are depicted forcomputer system 100. Specifically, when present, an operating system122, applications 124, modules 126, and data 128 are shown as typicallyresiding in one or some combination of computer usable volatile memory108, e.g. random access memory (RAM), and data storage unit 112. In oneembodiment, the present technology for SMTprocessing, for example, isstored as an application 124 or module 126 in memory locations withincomputer usable volatile memory 108 and memory areas within data storageunit 112.

It is understood that computer system 100 is only one example of asuitable computing environment and is not intended to suggest anylimitation as to the scope of use or functionality of the presenttechnology. Neither should computer system 100 be interpreted as havingany dependency or requirement relating to any one or combination ofcomponents illustrated in the computer system 100.

Embodiments of the present technology are operational with numerousother general-purpose or special-purpose computing system environmentsor configurations. Examples of well known computing systems,environments, and configurations that may be suitable for use with thepresent technology include, but are not limited to, personal computers,server computers, hand-held or laptop devices, multiprocessor systems,microprocessor-based systems, set-top boxes, programmable consumerelectronics, network PCs, minicomputers, mainframe computers,distributed computing environments that include any of the above systemsor devices, and the like.

Furthermore, embodiments of the present technology may be described inthe general context of computer-executable instructions, such as programmodules, being executed by a computer. Generally, program modulesinclude routines, programs, objects, components, data structures, etc.,that perform particular tasks or implement particular abstract datatypes. Embodiments of the present technology may also be practiced indistributed computing environments where tasks are performed by remoteprocessing devices that are linked through a communications network. Ina distributed computing environment, program modules may be located inboth local and remote computer-storage media including memory-storagedevices.

Although electronic and software-based systems are discussed herein,they are merely examples of computing environments that might beutilized, and are not intended to suggest any limitation as to the scopeof use or functionality of the present technology. Neither should suchelectronic systems be interpreted as having any dependency or relationto any one or combination of components illustrated in the disclosedexamples.

Moreover, the present technology may be described in the general contextof computer-executable instructions, such as program modules, beingexecuted by a computer. Generally, program modules include routines,programs, objects, components, data structures, etc., that performparticular tasks or implement particular abstract data types. Inaddition, the present technology may also be practiced in distributedcomputing environments where tasks are performed by remote processingdevices that are linked through a communications network. In adistributed computing environment, program modules may be located inboth local and remote computer-storage media including memory-storagedevices.

Although various embodiments of the present technology have beendescribed with reference to specific structural features and/ormethodological acts, it is to be understood that the subject matterdefined in the appended claims is not necessarily limited to thespecific features or acts described above. Rather, the specific featuresand acts described above are disclosed as example forms of implementingthe claims.

1. A method comprising: generating, by a computer, a plurality of wordstrings corresponding to a syntactic tree structure that represents asource word string in a natural source language; and selecting apreferred translation from translations of the plurality of wordstrings, the translations resulting from translating the plurality ofword strings into a natural target language, the translating based on aplurality of alignment matrices that map sample source sentences in thenatural source language to sample target sentences in the natural targetlanguage.
 2. The method of claim 1 further comprising reordering,according to the alignment matrices, terms of the generated plurality ofword strings with regard to a formal syntax.
 3. The method of claim 2wherein the translating is performed after the reordering.
 4. The methodof claim 2 wherein the reordering is further based on reorderingprobabilities calculated for each of the reordered plurality of wordstrings.
 5. The method of claim 1 wherein the selecting is based onscores assigned to the translations of the plurality of word strings. 6.The method of claim 5 further comprising generating the assigned scoresbased on a log-linear paradigm.
 7. The method of claim 1 furthercomprising building translation knowledge prior to the generating andselecting.
 8. At least one memory device storing instructions that, whenexecuted by a computer, cause computer to perform a method ofstatistical machine translation (SMT), the method comprising:generating, by a computer, a plurality of word strings corresponding toa syntactic tree structure that represents a source word string in anatural source language; and selecting a preferred translation fromtranslations of the plurality of word strings, the translationsresulting from translating the plurality of word strings into a naturaltarget language, the translating based on a plurality of alignmentmatrices that map sample source sentences in the natural source languageto sample target sentences in the natural target language.
 9. The atleast one memory device of claim 8, the method further comprisingreordering, according to the alignment matrices, terms of the generatedplurality of word strings with regard to a formal syntax.
 10. The atleast one memory device of claim 9 wherein the translating is performedafter the reordering.
 11. The at least one memory device of claim 9wherein the reordering is further based on reordering probabilitiescalculated for each of the reordered plurality of word strings.
 12. Theat least one memory device of claim 8 wherein the selecting is based onscores assigned to the translations of the plurality of word strings.13. The at least one memory device of claim 12, the method furthercomprising generating the assigned scores based on a log-linearparadigm.
 14. The at least one memory device of claim 8, the methodfurther comprising building translation knowledge prior to thegenerating and selecting.
 15. A system comprising: a computer; thecomputer configured for generating, by a computer, a plurality of wordstrings corresponding to a syntactic tree structure that represents asource word string in a natural source language; and the computerfurther configured for selecting a preferred translation fromtranslations of the plurality of word strings, the translationsresulting from translating the plurality of word strings into a naturaltarget language, the translating based on a plurality of alignmentmatrices that map sample source sentences in the natural source languageto sample target sentences in the natural target language.
 16. Thesystem of claim 15, the computer further configured for reordering,according to the alignment matrices, terms of the generated plurality ofword strings with regard to a formal syntax.
 17. The system of claim 16wherein the translating is performed after the reordering.
 18. Thesystem of claim 16 wherein the reordering is further based on reorderingprobabilities calculated for each of the reordered plurality of wordstrings.
 19. The system of claim 15 wherein the selecting is based onscores assigned to the translations of the plurality of word strings.20. The system of claim 19, the computer further configured forgenerating the assigned scores based on a log-linear paradigm.